In this course, we will learn to take advantage of the basic tools of open data science. They include:
My first thoughts when I found this course:
I have been looking for something like this.
Datacamp also seems like a nice learning platform becaue R is used through the browser and you can get immediate feedback.
My name is Janne Mikkonen. My GitHub repository can be found here.
I start by reading the “learning2014” data file that I have created before and exploring its dimensions and structure.
setwd("~/Documents/GitHub/IODS-project/data")
learning2014 <- read.csv(file = "learning2014.csv")
dim(learning2014)
## [1] 166 7
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
This data set includes 166 observations and 7 variables: gender, age, attitude, deep, stra, surf and points. Gender is a factor variable that sepates women and men. Age is an integer variable measured in years. Variables from attitude to surf are averages of several combined variables. Attitude measures global attitude towards statistics. Deep measures the dimension of deep learning, stra strategic learning and surf surface learning. Points indicates the total amount of exam points.
I continue by exploring the data graphically. We can plot an overview of all variables in the data with the help of GGally and ggplot2 packages.
library(GGally)
library(ggplot2)
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
There are more women than men in the data and most participants are in their 20s. Exam points has rather a strong positive correlation with attitude. The dimensions of deep learning, strategic learning and surface learning don’t have strong positive correlations wit each other, which is a good thing because they should measure separate dimensions. Deep learning and surface learning have a relatively strong negative correlation, as could be expected. Attitude and learning dimensions have almost normal distributions, whereas for exam points the lowest end of the scale is slightly underrepresented.
We can also look at the association between attitude and exam points by drawing a scatter plot. I willinclude separate regression lines for women and men to evaluate whether the associations differ by gender.
p1 <- ggplot(learning2014, aes(x = attitude, y = points, col = gender))
p2 <- p1 + geom_point()
p3 <- p2 + geom_smooth(method = "lm")
p4 <- p3 + ggtitle("Student's attitude versus exam points by gender")
p4
There seems to be a clear association between attitude and points as could already be seen from the correlations. There is not much difference between women and men.
I will continue by estimating a multiple regression model for exam points with three explanatory variables: attitude, strategic learning and surface learning. These variables had the largest correlations with exam points.
model1 <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(model1)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Attitude was the only varible to have a clear and statistically significant (at the conventional p<0.05 level) association with exam points: the increase of one point in attitude is associated with an average increase of 3.4 points. The p-value of p<0.001 tells us that there is a very small likelihood to observe an association this large if the zero hypothesis is true (i.e. relationship is zero in the population). Since the dimensions of strategic and surface learning showed no associations, I refit the model by excluding them.
model2 <- lm(points ~ attitude, data = learning2014)
summary(model2)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The coefficient of attitude remains about the same as before. An increase of one point in attitude is associated with an average increase of 3.5 points. The intercept of the model tells us that if attitude is zero, the estimated exam points is 11.6. Based on the multiple R-squared, we can conclude that attitude towards statistics explains around 19% of the variation in exam score.
I will next explore the assumptions of the linear regression model by plotting the residuals of model2 with three diagnostic plots.
par(mfrow = c(2,2))
plot(model2, which = c(1, 2, 5))
The first graph, “Residuals vs Fitted” can be used to explore the assumption that the size of the errors does not depend on the values of the explanatory variables. Here, we have simply plotted the residuals (dots) with the model prediction (line). The variance of residuals seems mostly the same throughout the values of x, although it becomes slightly smaller with the largest values of x. Moreover, there are a few cases with unexpectedly low exam scores when x is between 24 and 26.
QQ-plot is used for exploring the assumption that the errors are normally distributed. If the points follow the line drawn in the graph, we can conclude that the errors are normally distributed. In this case, there seems to be some small violation in the extremes of the distribution, but overall the assumption holds reasonably well.
“Residuals vs leverage” plot helps to identify which observations have an unusually high impact on the model. This plot includes leverage, which indicates how much the observation’s value on the predictor variable differs from its mean, and “Cook’s distance”, which is a measure of much the exclusion of that observation would change our model results. Problematic observations have both high leverage and high Cook’s distance. In our situation, we cannot even see the dashed Cook’s distance curves in the plot so there seems to be no problem with outliers.
To conclude, we only found evidence for an association between attitude towards statistics and exam score, whereas strategic learning and deep learning did not seem to predict performance in the exam.
I start by reading the “alc” data file that I created in the data wrangling excercise. This data file is a combination of two data sets measuring student performance in (1) mathematics and (2) portuguese language in two Portuguese secondary-level schools.
setwd("~/Documents/GitHub/IODS-project/data")
alc <- read.csv(file = "alc.csv")
After loading the data, I continue by exploring its structure and dimensions.
str(alc)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
dim(alc)
## [1] 382 35
Our data includes 35 variables (columns) and 382 observations (rows). When creating the combined analysis data we have done the following adjustments:
In this excercise I will examine the effect of sex, urban/rural residence, mother’s education and attendance to extracurricular activities on the probability of high alcohol use. I hypothesize that
Next, I will explore the distributions of the outcome variable and the four chosen explanatory variables. I create a new data frame “alc5” that only includes these five variables. I start by exploring the distributions graphically.
library(tidyr); library(dplyr); library(ggplot2)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
alc5 <- select(alc, sex, address, goout, activities, high_use)
gather(alc5) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
Around half of the students attend extracurricular activities. Living in an urban area is more common than living in a rural area, and the extremes of going out with friends are less common than the middle of the scale. Girls are only slightly overrepresented in the sample. Around one third of the sample engages in high alcohol use.
I continue by looking at the relative frequencies of low/high alcohol use by the four explanatory variables.
table(alc5$activities, alc5$high_use) %>% prop.table(1) %>% round(2)
##
## FALSE TRUE
## no 0.67 0.33
## yes 0.73 0.27
table(alc5$address, alc5$high_use) %>% prop.table(1) %>% round(2)
##
## FALSE TRUE
## R 0.62 0.38
## U 0.72 0.28
table(alc5$goout, alc5$high_use) %>% prop.table(1) %>% round(2)
##
## FALSE TRUE
## 1 0.86 0.14
## 2 0.84 0.16
## 3 0.82 0.18
## 4 0.51 0.49
## 5 0.40 0.60
table(alc5$sex, alc5$high_use) %>% prop.table(1) %>% round(2)
##
## FALSE TRUE
## F 0.79 0.21
## M 0.61 0.39
Those attending extracurricular activities are slightly less common to have high alcohol consumption, as hypothesized. Opposite to the hypothesis, consuming high amounts of alcohol is less common for those living in urban areas. In line with my hypothesis, high alcohol use is more common among students going out with friends often and male students.
I move on to estimating a logistic regression model with high alcohol use as the outcome variable and activities, address, going out with friends and sex as explanatory variables.
m <- glm(high_use ~ activities + address + goout + sex, data = alc5, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ activities + address + goout + sex,
## family = "binomial", data = alc5)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7579 -0.7750 -0.5065 0.7677 2.4999
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.1340 0.4927 -6.361 2.00e-10 ***
## activitiesyes -0.5145 0.2532 -2.032 0.04215 *
## addressU -0.7487 0.2995 -2.500 0.01242 *
## goout 0.8028 0.1212 6.622 3.55e-11 ***
## sexM 0.9398 0.2536 3.706 0.00021 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 392.97 on 377 degrees of freedom
## AIC: 402.97
##
## Number of Fisher Scoring iterations: 4
OR <- coef(m) %>% exp %>% round(2)
CI <- confint(m) %>% exp %>% round(2)
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.04 0.02 0.11
## activitiesyes 0.60 0.36 0.98
## addressU 0.47 0.26 0.85
## goout 2.23 1.77 2.85
## sexM 2.56 1.57 4.24
All explanatory variables were statistically significantly (p<0.05 and none of the 95% confidence intervals overlapped 1) associated with high alcohol use. The odds of high use for children who participated in extracurricular activities were 0.60-fold when compared with students who did not participate. In other words, this means that the propability of high use was lower among children who participated in extracurricular activities. Similarly, urban residence was associated with lower odds of high alcohol use and male sex with higher odds of high alcohol use. The frequency of going out with friends was a numeric varible in our data: for a one-unit increase in the frequency of going, increase in the odds of high alcohol use was 2.56-fold.
Overall, our hypotheses related extracurricular activities, going out with friends and sex were supported by the logistic regression analysis. On the other hand, I hypothesized that urban residence would be associated with higher propability of high alcohol use, while in the analysis urban residence was in fact associated with much lower odds high use.
Next, I will explore the predictive power of my logistic model where all explanatory variables were clearly associated with the odds of high alcohol use. I begin by creating a 2x2 table, which shows the numbers of correct and false predictions.
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc5 <- mutate(alc5, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc5 <- mutate(alc5, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc5$high_use, prediction = alc5$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 245 23
## TRUE 71 43
We can see that for 245 cases the model correctly predicted that they do not use high amounts of alcohol and for 43 cases that they do have high alcohol use. On the other hand, 71 cases were incorrectly classified as not-high-users while in reality they were high-users.
I continue by (1) creating a graph of the actual cases and predictions of high use, (2) a propability table of the predictions and finally (3) a function calculating the number of incorrect predictions.
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc5, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc5$high_use, prediction = alc5$prediction) %>% prop.table() %>% addmargins() %>% round(2)
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64 0.06 0.70
## TRUE 0.19 0.11 0.30
## Sum 0.83 0.17 1.00
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc5$high_use, prob = alc5$probability)
## [1] 0.2460733
Looking at the graph, we can see that there were quite many cases who were predicted to have low propability of high alcohol use but were still high users. Around 19% of predictions were incorrect in that way, while only 6% of cases were incorrectly predicted to have high use when in reality they did not have. Overall, around one fourth of cases were incorrectly classified on account of their alcohol use. This result is definitely better than tossing a coin and quite good actually given that we only had four explanatory varibles in the model. At the same time, one fourth is still pretty large a proportion of false predictions and could probably be improved by including more variables.
Using the above-defined loss function I can perform a ten-fold cross-validation on my model. In this cross-validations the analysis data is first divided into ten equal sized parts. Each part in turn acts a “testing data” while the other nine parts form a “training data”. Training data is used for fitting the model while the testing data is used for testing it. Based, on the ten tests we can calculate the average number of wron predictions.
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc5, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2460733
In the ten-fold cross-validation the predition error was around one fourth i.e. about the same as the prediction error of our original full-sample model. My prediction error was also about the same as in the Datacamp excercise, although I mostly used a different set of explanatory variables.
I start by loading the Boston Housing Values data set used in the analysis. I will also explore its structure and dimensions.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
This data set contains 14 variables and 506 observations. The observations are suburbs within the Boston region and the varibles measure different attributes of these suburbs. I continue by looking at the distributions and correlations of the variables.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
library(tidyr)
library(corrplot)
## corrplot 0.84 loaded
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(2)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
By looking at the distributions, we can see that there are some suburbs that clearly diverge from the general pattern. For instance, in one suburb the crime rate is 89 per capita (crim) while the median in Boston is only 0.26. A similar pattern can be seen for the proportion of residential land zoned for lots over 25,000 sq.ft. (zn), the proportion of blacks (black) and the median value of owner-occupied homes (medv).
The correlation plot indicates which variables have the strongest positive (blue) and negative (red) correlations. Some of the strongest positive correlations can be found between full-property taxes (tax) and accessibility to radial highways (rad) as well as nitrogen oxides and the proportion of non-retail business. Distances to five employment centers (dist) in turn have strong negative correlations with nitrogen oxides and non-retail businesses. On the other hand, Charles River dummy varible does not show correlations with any other variables.
The variables need to be centered and standardized, i.e. scaled, for the classification and clustering analyses.
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
We can see that all variables have mean zero. The outlier cases can still be clearly seen. Next, I will create a categorical variable indicating the quantiles of the (scaled) crime rate. This variable replaces the original continuous variable.
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
summary(crime)
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Everything seems to be fine with the categorical variable so I will continue by splitting the original data set to test (80%) and train (20%) data sets. The training of the model is done with the train set and prediction on new data is done with the test set.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Now, I am ready to fit the linear discriminant analysis on the train set. This analysis finds the linear combination of the variables that separate the target variable classes. The categorical crime rate will serve as the target variable and all the other variables in the dataset as predictor variables.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2475248 0.2475248 0.2450495 0.2599010
##
## Group means:
## zn indus chas nox rm
## low 0.93971091 -0.9037551 -0.154216061 -0.8524253 0.46388793
## med_low -0.09512877 -0.2775628 0.003267949 -0.5394142 -0.15236518
## med_high -0.40754935 0.2325600 0.284432582 0.3928027 0.07661393
## high -0.48724019 1.0149946 -0.084848104 1.0189743 -0.39189006
## age dis rad tax ptratio
## low -0.8785045 0.8082368 -0.6924575 -0.7448635 -0.44413755
## med_low -0.3193345 0.3101668 -0.5557899 -0.4856921 -0.09632279
## med_high 0.3931425 -0.3566504 -0.3844363 -0.2697891 -0.19734943
## high 0.7959843 -0.8414701 1.6596029 1.5294129 0.80577843
## black lstat medv
## low 0.3787351 -0.76658957 0.555522002
## med_low 0.3492143 -0.13717334 -0.004763056
## med_high 0.1010690 0.06572706 0.156562428
## high -0.8916147 0.90277143 -0.711553858
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.10701570 0.780418497 -0.87187927
## indus 0.03161207 -0.262327832 0.31197954
## chas -0.06312356 -0.113448997 0.09289584
## nox 0.26133506 -0.697504226 -1.49379684
## rm -0.06669223 -0.121530860 -0.19685362
## age 0.31164820 -0.277613877 -0.08468761
## dis -0.06587278 -0.405499060 0.08158625
## rad 3.19136586 1.096457917 0.03493189
## tax -0.00604261 -0.238351641 0.54822209
## ptratio 0.15667995 -0.052300407 -0.40366861
## black -0.18810129 0.006945473 0.16289532
## lstat 0.19971318 -0.407760364 0.17451790
## medv 0.17617835 -0.494365901 -0.39829530
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9513 0.0370 0.0117
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
The plot reveals that the suburbs of crime rate are the most distinctive with regards to our selection of predictor variables. Accessibility to radial highways is the most distinctive feature of the high crime rate areas. Med high and med low are the most overlapping.
In order to examine how well our model performs in predicting crime rate, I save the correct classes of the test data as a separate object.
library(dplyr)
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
Now I can predict the classes with the LDA model on the test data and compare the predictions results with the actual categories.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 17 7 3 0
## med_low 5 17 4 0
## med_high 2 7 18 0
## high 0 0 1 21
All suburbs with high crime rate were correctly predicted. This is not a surprise given that high crime rate suburbs turned out so disctinctive in the original analysis. The model had more difficulties in separating med-high suburbs from med-low. Low and med low-were somewhat overlapping in the original analysis, and also in this test some of the low crime suburbs were classified as med-low. Despite this, most predictions were correct.
Next, I will perform a K-means clustering analysis on the Boston data set. I will first reload and scale the data calculate the Euclidean distances between the observations.
data("Boston")
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# euclidean distance matrix
dist_eu <- dist(Boston)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.119 85.624 170.539 226.315 371.950 626.047
I begin the K-means clustering analysis by estimating four clusters. I will visualise some of the columns.
# k-means clustering
km <-kmeans(Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston[5:10], col = km$cluster)
Four clusters could be too much as it is difficult distinguish them in the plot. It is possible to identify the optimal amount of clusters by looking at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes.
library(ggplot2)
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
It seems that already after the first cluster there is a clear drop in WCSS. Since WCSS still drops after the second cluster, I will estimate a two-cluster solution.
# k-means clustering
km <-kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston[1:7], col = km$cluster)
pairs(Boston[8:14], col = km$cluster)
The two-cluster solution is clearer to interpret than the original four-cluster solution. Accessibility to radial highways is once again one of those variables that define the clusters and full-value property-tax rate is another. These varibles probably identify the central areas of Boston. Also crime rate is one of the attributes that separate the clusters.
Our data set originagates from the United Nations Development Programme. It is a combination of two separate data sets measuring human development and gender (in)equality for the same coutries. I have modified the data so that it only includes 8 indicators of human development / equality and only for those countries that did not have any missing values for these indicators. I start by reading the data and examining its structure and dimensions.
setwd("~/Documents/GitHub/IODS-project/data")
human <- read.csv(file = "human.csv", row.names = 1)
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ f2edu : num 97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
## $ flabour : num 61.2 58.8 61.8 58.7 58.5 53.6 53.1 56.3 61.6 62 ...
## $ expedu : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ le : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ gni : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ matmort : int 4 6 6 5 6 7 9 28 11 8 ...
## $ adolbirthrate: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ parliament : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155 8
We can see that there are 155 observations, i.e. countries, in the data. The 8 variables measure
I continue by examining the distributions of the variables and their relationships with each other.
summary(human)
## f2edu flabour expedu le
## Min. : 0.90 Min. :13.50 Min. : 5.40 Min. :49.00
## 1st Qu.: 27.15 1st Qu.:44.45 1st Qu.:11.25 1st Qu.:66.30
## Median : 56.60 Median :53.50 Median :13.50 Median :74.20
## Mean : 55.37 Mean :52.52 Mean :13.18 Mean :71.65
## 3rd Qu.: 85.15 3rd Qu.:61.90 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :100.00 Max. :88.10 Max. :20.20 Max. :83.50
## gni matmort adolbirthrate parliament
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
library(GGally)
ggpairs(human)
If we look at the summary table, we can see that all variables vary strongly between the coutries. For instance, the proportion of women with a least secondary education ranges from 0.9% to 100% and female parliamentary representation from 0% to 57.5%. Some of the varibles have quite strong correlations with each other. For example, life expectancy is strongly associated with higher expected years of education and lower adolescent birth rate. On the other hand, female parliamentary representation does not show strong correlation with any of the other variables. Expected years of education is the only variable with somewhat normal distribution. For maternal mortality, there are a few countries with very high ratios but most countries fall to the lower end of the distribution.
Next, I will perform a principal component analysis on the data described above. I plot the results with a biplot that shows two of the most important principal components that were identified as x-axis and y-axis.
#The model
pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 186.2835 25.97 20.07 14.32 10.63 3.721
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.00 0.00 0.000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.00 1.00 1.000
## PC8
## Standard deviation 1.428
## Proportion of Variance 0.000
## Cumulative Proportion 1.000
biplot(pca_human, choices = 1:2, cex = c(0.5, 0.8), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
The plot is quite difficult to read as most countries fall to the top right corner of it. Because we have not standardized the variables, gross national income (GNI) is the most significant due to its large variance. In fact, the first principal component seems to account for 100% of the variance. GNI defines the first principal component, but it is difficult to say what defines the second, as the other arrows are not drawn. Let’s rerun the analysis with scaled variables.
#Scale
human_std <- scale(human)
#Estimation
pca_human <- prcomp(human_std)
s <- summary(pca_human)
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
# create object pc_lab to be used as axis labels
pc_lab <- paste0(c("Underdevelopment", "Gender equality"), " (", pca_pr, "%)")
biplot(pca_human, choices = 1:2, cex = c(0.5, 0.8), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
After scaling the variables, the plot becomes much clearer to interpret. The biplot arrows reveal that the first principal compoment measures human development or, more precisely, underdevelopment (high maternal mortality, low life expectancy, low education etc), whereas the second component measures gender equality separate of human development (female parliamentary representation and female labour market participation). It is easy to see why because when we looked at correlations, parliament and female labour were the only variables that did not have strong correlations with the other variables. I have also labeled the principal components in the plot accordingly. The proportions of variance accounted for by the components are shown in parenthesis. Human development component accounted for more than half of the variance. Because the negative values of human development component indicate high development, the Nordic countries (highly developed and high gender equality) can be found in the top-left corner of the plot. Interestingly, there are quite many underdeveloped countries with high equality and vice versa.
Next, I move up to multiple correspondence analysis. Here I will use a data set that contains the answers of a questionnaire on tea consumption. I start by loading the data and exploring its structure and dimensions.
library(FactoMineR)
data("tea")
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
This tea consumption data set has 300 obsevations (survey participants) and 36 variables (questions). The variables include questions on backbround factors (sex, age etc), tea consumption habits (frequency, during lunch, at work etc), and tea-related notions and attitudes (healthiness, femininity, relaxation).
To improve the clarity of the analysis, I will only select a subset of ten variables.
# column names to keep in the dataset
keep_columns <- c("Tea", "sex", "where", "SPC", "friends", "healthy", "sophisticated", "frequency", "breakfast", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))
Next, I will perform a multiple correspondence analysis on these ten variables.
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.174 0.166 0.144 0.138 0.122 0.118
## % of var. 9.169 8.726 7.557 7.283 6.395 6.237
## Cumulative % of var. 9.169 17.895 25.452 32.735 39.130 45.367
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.113 0.107 0.105 0.093 0.089 0.087
## % of var. 5.945 5.658 5.551 4.919 4.699 4.604
## Cumulative % of var. 51.312 56.970 62.520 67.439 72.138 76.742
## Dim.13 Dim.14 Dim.15 Dim.16 Dim.17 Dim.18
## Variance 0.075 0.073 0.070 0.065 0.059 0.054
## % of var. 3.927 3.855 3.666 3.432 3.105 2.820
## Cumulative % of var. 80.669 84.524 88.190 91.622 94.727 97.547
## Dim.19
## Variance 0.047
## % of var. 2.453
## Cumulative % of var. 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2
## 1 | 0.295 0.167 0.044 | 0.882 1.563 0.392 |
## 2 | 0.096 0.017 0.005 | 0.732 1.078 0.281 |
## 3 | -0.058 0.007 0.002 | -0.148 0.044 0.010 |
## 4 | 0.416 0.330 0.146 | -0.076 0.011 0.005 |
## 5 | 0.003 0.000 0.000 | 0.325 0.213 0.066 |
## 6 | 0.320 0.196 0.074 | 0.045 0.004 0.001 |
## 7 | -0.105 0.021 0.005 | 0.186 0.069 0.015 |
## 8 | 0.511 0.500 0.128 | -0.101 0.021 0.005 |
## 9 | -0.179 0.062 0.016 | 0.402 0.324 0.082 |
## 10 | -0.282 0.152 0.035 | 0.920 1.703 0.378 |
## Dim.3 ctr cos2
## 1 -0.227 0.120 0.026 |
## 2 -0.400 0.371 0.084 |
## 3 -0.469 0.511 0.101 |
## 4 -0.363 0.306 0.111 |
## 5 -0.391 0.355 0.095 |
## 6 -0.470 0.512 0.158 |
## 7 0.370 0.318 0.060 |
## 8 -0.080 0.015 0.003 |
## 9 0.499 0.579 0.127 |
## 10 0.353 0.289 0.056 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.017 0.004 0.000 0.170 | 0.537 4.294
## Earl Grey | -0.186 1.281 0.063 -4.326 | -0.278 3.004
## green | 1.051 6.971 0.136 6.387 | 0.423 1.185
## F | -0.339 3.912 0.168 -7.078 | -0.247 2.191
## M | 0.494 5.707 0.168 7.078 | 0.361 3.197
## chain store | 0.183 1.236 0.060 4.229 | -0.138 0.739
## chain store+tea shop | -0.817 9.950 0.234 -8.369 | 0.180 0.509
## tea shop | 0.949 5.170 0.100 5.470 | 0.417 1.049
## employee | 0.135 0.206 0.004 1.155 | 0.070 0.059
## middle | 0.196 0.295 0.006 1.333 | 0.526 2.222
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.095 5.316 | 0.211 0.765 0.015 2.087 |
## Earl Grey 0.140 -6.462 | -0.125 0.695 0.028 -2.892 |
## green 0.022 2.569 | 0.255 0.499 0.008 1.552 |
## F 0.089 -5.168 | -0.266 2.924 0.103 -5.556 |
## M 0.089 5.168 | 0.388 4.267 0.103 5.556 |
## chain store 0.034 -3.190 | -0.461 9.490 0.378 -10.638 |
## chain store+tea shop 0.011 1.846 | 0.573 5.936 0.115 5.868 |
## tea shop 0.019 2.404 | 1.464 14.937 0.238 8.441 |
## employee 0.001 0.602 | -0.797 8.709 0.156 -6.822 |
## middle 0.043 3.565 | 0.426 1.688 0.028 2.892 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.144 0.141 0.028 |
## sex | 0.168 0.089 0.103 |
## where | 0.285 0.038 0.436 |
## SPC | 0.150 0.270 0.444 |
## friends | 0.119 0.162 0.038 |
## healthy | 0.011 0.136 0.050 |
## sophisticated | 0.032 0.049 0.033 |
## frequency | 0.405 0.391 0.184 |
## breakfast | 0.206 0.351 0.023 |
## lunch | 0.222 0.030 0.097 |
# visualize MCA
plot(mca, invisible=c("var"), habillage = "quali")
plot(mca, invisible=c("ind"), habillage = "quali")
From the summary of the model, we can see that none of the dimensions identified account for more than 10% of the variance in the data. The first plot shows how the individuals divide on the two dimensions, while the second plot shows how the variables relate to the dimensions. We can see that most individuals fall in the middle of the plot and there are no clear outlier cases. This could be related to the fact there are no strong dimensions in the data.
When we look at the variable plot we can see some patterns in the data. Buying tea from tea shops, consuming green tea, consuming tea often and alone, and, interestingly, being a workman are found close to each other. On the other hand, being a student, consuming tea less frequently and considering tea unhealthy are also related as are being a senior and drinking tea in the morning. Nevertheless, there are few clearly identifiable patterns in the data when we include these ten variables.